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Figure 1: MeshAnything converts any 3D representation into Artist-Created Meshes (AMs), i.e., 
meshes created by human artists. It can be combined with various 3D asset production pipelines, 
such as 3D reconstruction and generation, to transform their results into AMs that can be seamlessly 
applied in the 3D industry. 
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Abstract 


Recently, 3D assets created via reconstruction and generation have matched the 
quality of manually crafted assets, highlighting their potential for replacement. 
However, this potential is largely unrealized because these assets always need to 
be converted to meshes for 3D industry applications, and the meshes produced 
by current mesh extraction methods are significantly inferior to Artist-Created 
Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh 
extraction methods rely on dense faces and ignore geometric features, leading 
to inefficiencies, complicated post-processing, and lower representation quality. 
To address these issues, we introduce MeshAnything, a model that treats mesh 
extraction as a generation problem, producing AMs aligned with specified shapes. 
By converting 3D assets in any 3D representation into AMs, MeshAnything can 
be integrated with various 3D asset production methods, thereby enhancing their 
application across the 3D industry. The architecture of MeshAnything comprises 
a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a 
mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder- 
only transformer on this vocabulary for shape-conditioned autoregressive mesh 
generation. Our extensive experiments show that our method generates AMs with 
hundreds of times fewer faces, significantly improving storage, rendering, and 
simulation efficiencies, while achieving precision comparable to previous methods. 


1 Introduction 


In recent years, the 3D community has experienced rapid advancements, with a variety of meth- 
ods developed for automatically producing high-quality 3D assets. These methods, including 3D 
reconstruction [44, 72, 2, 3, 34, 31], 3D generation [47, 39, 64, 41, 56, 30, 57, 69, 66], and scan- 
ning [17, 26, 27], can produce 3D assets with shape and color quality comparable to manually created 
ones. The success of these methods reveals the potential to replace manually created 3D models with 
automatically produced ones in the 3D industry, including applications in games, movies, and the 
metaverse, significantly reducing time and labor costs. 


However, this potential remains largely unrealized because the current 3D industry predominantly 
relies on mesh-based pipelines for their superior efficiency and controllability, while methods for 
producing 3D assets typically use alternative 3D representations to achieve optimal results across 
various scenarios. Therefore, substantial efforts [42, 16, 43, 51, 12, 53] are devoted to converting 
other 3D representations into meshes and have achieved some success. Meshes produced by these 
methods approximate the shape quality of those created by human artists, which we refer to as 
Artist-Created Meshes (AMs), but they still fall short in addressing the aforementioned issues. 


This is because all meshes produced by these methods [42, 16, 43, 51, 12, 53] exhibit significantly 
poorer topology quality compared to AMs. As shown in Fig. 2, these methods rely on dense faces to 
reconstruct 3D shapes, completely ignoring geometric characteristics. Using these meshes in the 3D 
industry leads to three significant problems: First, converted meshes typically contain several orders 
of magnitude more faces compared to AMs, leading to significant inefficiencies in storage, rendering, 
and simulation. Moreover, the converted meshes complicate post-processing and downstream tasks 
in the 3D pipeline. They significantly increase the challenge for human artists in optimizing these 
meshes due to their chaotic and inefficient topologies. Finally, previous methods struggle to represent 
sharp edges and flat surfaces, resulting in oversmoothing and bumpy artifacts as shown in Fig. 2. 


In this work, we aim to solve the aforementioned issues to facilitate the application of automatically 
generated 3D assets in the 3D industry. As mentioned earlier, all previous methods [42, 16, 43, 51, 
12, 53] extract 3D meshes with excessively dense faces in a reconstruction manner, which inherently 
cannot solve these issues. Therefore, we diverge from previous approaches by formulating mesh 
extraction as a generation problem for the first time: we teach models to generate Artist-Created 
Meshes (AMs) that are aligned with the given 3D assets. The meshes generated by our methods 
mimic the shape and topology quality of those created by human artists. Consequently, our setting, 
namely Shape-Conditioned AM Generation, is fundamentally free from all previous issues, enabling 
seamless integration of the generated results into the 3D industry pipeline. 
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Figure 2: Comparison with Marching Cubes [42] and Remesh [4]. We apply Marching Cubes 
and MeshAnything to ground truth shapes and then apply remeshing to the Marching Cubes results 
with different voxel sizes. Existing methods extract meshes in a reconstruction manner, ignoring the 
geometric features of the object and producing dense meshes with poor topology. These methods 
fundamentally fail to capture sharp edges and flat surfaces, as shown in the zoomed-in figure. 


However, training such a model presents significant challenges. The first challenge is constructing 
the dataset, as we need paired shape conditions and Artist-Created Meshes (AMs) for model training. 
The shape condition must be efficiently derived from as many diverse 3D representations as possible 
to serve as a condition during inference. Additionally, it must have sufficient precision to accurately 
represent 3D shapes and be efficiently processed into features that can be injected into the model. After 
weighing the trade-offs, we chose point clouds due to their explicit and continuous representation, ease 
of derivation from most 3D representations, and the availability of mature point cloud encoders [48, 
49, 75]. 


We filter out high-quality AMs from Objaverse [19, 18] and ShapeNet [7]. When obtaining paired 
shape conditions, a naive approach would be to sample point clouds directly from AMs. However, 
this leads to poor results during inference because the sampled point clouds have excessive precision, 
while automatically produced 3D assets cannot provide point clouds of similar quality, causing a 
domain gap between training and inference. To address this issue, we intentionally corrupt the shape 
quality of AMs. We first extract the signed distance function from AMs [63], convert it into a coarser 
mesh using [42], and then sample point clouds from this coarse mesh to narrow the domain gap in 
shape conditions between inference and training. 


Following [55], we use a VQ-VAE [61] to learn a mesh vocabulary and train a decoder-only trans- 
former [62] on this vocabulary for mesh generation. To inject shape condition, we draw inspiration 
from the recent success of multimodal large language models (MLLM) [68, 37], where image features 
encoded by pre-trained image encoders are projected into the token space of the large language 
models for efficient multimodal understanding. Similarly, we treat the mesh tokens obtained from 
the trained VQ-VAE as the language token in LLMs and use a pre-trained encoder [75] to encode 
the point clouds into shape features, which is later projected into the mesh token space. These shape 
tokens are placed at the beginning of the mesh token sequences, effectively serving as the shape 
conditions for next-token predictions. After predictions, these predicted mesh tokens are decoded 
back to meshes with the VQ-VAE decoder [55]. 


To further enhance the quality of mesh generation, we develop a novel noise-resistant decoder for 
robust mesh decoding. Our observation is that as the decoder in the VQ-VAE [61] is only trained 
with ground truth token sequences from the encoder, it could potentially lead to a domain gap when 
decoding the generated token sequences. To mitigate this problem, we inject the shape condition into 
the VQ-VAE decoder as auxiliary information for robust decoding and fine-tune it after the VQ-VAE 
training. This fine-tuning process involves adding noise to the mesh token sequences to simulate 
possible poor-quality token sequences from the decoder-only transformer, thus making the decoder 
robust to such poor-quality sequences. 


Finally, we introduce our model, MeshAnything, trained based on the aforementioned techniques. 
As shown in Fig. 1, MeshAnything can convert 3D assets across various 3D representations into 
AMs, thereby significantly facilitating their application. Furthermore, our extensive experiments 
demonstrate that our method generates AMs with significantly fewer faces and more refined topology, 
while achieving precision metrics that are close to or comparable with previous methods. 


In summary, our contributions are threefold: 


e We highlight one important reason why current automatically produced 3D assets cannot 
replace those created by human artists: current methods cannot convert these 3D assets 
into Artist-Created Meshes (AMs). To solve this issue, we propose a novel solution called 
Shape-Conditioned AM Generation, which aims to generate AMs aligned with given shapes. 


We introduce MeshAnything for Shape-Conditioned AM Generation. MeshAnything can be 
integrated with various 3D asset production methods, converting their results into AMs to 
facilitate their application in the 3D industry. 


We develop a novel noise-resistant decoder to enhance mesh generation quality. We inject the 
shape condition into the decoder as auxiliary information for robust decoding and fine-tune 
it using noised token sequences to narrow the domain gap between training and inference. 


2 Related Works 


2.1 Mesh Extraction 


Methods for extracting meshes from 3D models are numerous and have been a subject of research for 
decades. Following [53], we categorize these methods into two main types: Isosurface Extraction [42, 
5, 16, 6, 43, 12] and Gradient-Based Mesh Optimization [9, 23, 28, 33, 52, 36, 53]. 


Traditional isosurface extraction methods [42, 43, 16, 21, 32, 50, 13, 12] focus on extracting a 
polygonal mesh that represents the level set of a scalar function, an area that has seen extensive study 
in various fields. The most popular method among them is Marching Cubes [42]. It divides the space 
into cells, within which polygons are created to approximate the surface. Marching Cubes has been 
widely used for mesh extraction its robustness and simplicity. To more effectively capture sharp 
features, Dual Contouring [50] adopts a dual representation by extracting mesh vertices per cell and 
estimates vertex positions based on local isosurface details. Dual Marching Cubes [46] is another 
advanced approach that combines the advantages of both Marching Cubes and Dual Contouring. 
Recently, [13] and [12] introduce a data-driven method to determine the position of the extracted 
mesh based on the input field. 


Transitioning to more recent developments, the advent of machine learning has ushered in new 
techniques for generating 3D meshes [9, 23, 28, 33, 52, 36, 53]. This line of work explores using 
neural networks to generate 3D meshes, where the network parameters are optimized through 
gradient-based methods under specific loss functions. These approaches leverage the computational 
power of machine learning to refine mesh quality and adaptability, making them suitable for more 
complex geometrical structures that traditional methods may struggle to process efficiently. [23] 
proposes predicting a deformable tetrahedral grid for 3D shape representation. Besides, [52] employs 
a differentiable Marching Tetrahedra layer for mesh extraction. Similar to [52], [53] iteratively 
optimizes a 3D surface mesh by representing it as the isosurface of a scalar field. 


However, these approaches fundamentally differ from ours. They ignore the characteristics of the 
shape and inherently cannot produce meshes with efficient topology. In contrast, Shape-Conditioned 
AM Generation formulates mesh extraction as a generation problem for the first time, aiming to 


mimic human artists in mesh extraction and thereby generating Artist-Created Meshes (AMs) with 
hundreds of times fewer faces. 


2.2 3D Mesh Generations 


3D mesh generation can be mainly divided into two categories: generating dense meshes similar to 
those produced by previous mesh extraction methods, and generating Artist-Created Meshes (AMs). 


The former category is currently the mainstream research focus. Methods such as [24, 66, 69] directly 
generate meshes in a feed-forward manner, but because they produce dense meshes with low-quality 
topology similar to previous mesh extraction methods, they still encounter the same issues when 
applied in the 3D industry. The most popular techniques in this category are LRM-based mesh 
generation models [30, 66, 69], which utilize transformers to generate 3D meshes in a feed-forward 
manner, often conditioning on images for control. 


Notably, numerous 3D generation methods [47, 59, 64, 11, 58, 71, 30, 22, 10, 38, 54, 35, 14, 15, 57, 
65, 60] can also produce meshes. These methods do not directly generate meshes but instead produce 
3D assets in other representations. In practice, these 3D assets often need to be converted into meshes 
using mesh extraction methods, establishing a connection to our approach. Similar to direct mesh 
generation methods, these approaches primarily focus on the generation of shape and color while 
ignoring the topology of the mesh, resulting in dense meshes. Consequently, they face challenges 
when applied to the 3D industry due to their inefficient topology. 


Recently, several works have focused on the second category: generating Artist-Created 
Meshes(AMs) [45, 1, 55, 8]. Although our approach also focuses on AM generation, it funda- 
mentally differs from these methods. Since they lack shape conditioning, these methods must 
simultaneously learn the complex 3D shape distribution—which typically alone requires exten- 
sive training [30, 57]—and the topology distribution of AMs, leading to very challenging training 
processes. In contrast, Shape-Conditioned AM Generation only needs to learn how to construct 
efficient topology for a known shape, making the learning process significantly easier. Numerous 
mature methods [44, 34, 30, 57, 2, 3, 66, 69] already effectively produce high-quality 3D shapes; by 
combining our approach with these methods, we can achieve similar outcomes to those of direct AM 
generation methods. This will be detailed in Sec. 3. 


Among these methods, the most relevant to ours is MeshGPT [55], as we follow its architecture. [55] 
introduced a combination of a VQ-VAE [61] and an autoregressive transformer architecture. It first 
learns a mesh vocabulary with the VQ-VAE and then trains the transformer on the learned vocabulary 
for mesh generation. However, MeshGPT’s results are limited to several categories in ShapeNet. 
MeshGPT requires a training GPU hours similar to ours, but our method can generalize to unlimited 
categories in Objaverse. This is largely due to the difference in target complexity caused by MeshGPT 
needing to additionally learn the complex 3D shape distribution. 


3  Shape-Conditioned AM Generation 


In this section, we first introduce the formal formulation for Shape-Conditioned AM Generation and 
compare it with previous mesh generation settings [45, 55, 1]. We show that it can achieve the same 
targets as previous mesh generation methods with significantly less training effort. 


Shape-Conditioned AM Generation targets to estimate a conditional distribution p(M|S). In this 
formula, M refers to the Artist-Created Mesh (AM), i.e., the mesh manually modeled by human 
artists. S refers to the 3D shape information that indicates the 3D shape to which M should align. 
The input form of S can be diverse, such as voxels or point clouds. Therefore, this versatility allows 
our method to be integrated with any 3D pipeline that outputs S, such as 3D reconstruction [44, 34], 
generation [47, 30], and scanning, making these methods more efficient for the 3D industry. 


Compared to existing AM generation work, they directly estimate the distribution p(M|C), where 
C denotes conditions such as images, text or empty sets for unconditional generation. However, 
estimating p(M|C) requires an understanding of both the underlying shape, i.e., S, and complex 
topological structures M. Given this, we made the following approximation: 


P(MIC) ~ p(M, SIC). (1) 
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Figure 3: Pipeline Overview. We introduce MeshAnything, an autoregressive transformer capable 
of generating Artist-Created Meshes that adhere to given 3D shapes. We sample point clouds from 
given 3D assets, encode them into features, and inject them into the decoder-only transformer to 
achieve shape-conditional mesh generation. 
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According to the chain rule, we have: 
P(M,S|C) = p(M|S,C) - p(S|C). (2) 


For distribution p(M|S,C), given that S is a much stronger and more direct condition than C, we 
can make the following approximation: 


P(M|S,C) ~ p(M|S). (3) 


Combining 1, 2 and 3: 

P(MIC) ~ p(M|S) - p(SIC), (4) 
in which p(M|S) is the focus of our shape-conditioned mesh generation. In the 3D community, 
numerous large models [30, 57, 69, 55] aim to estimate p(S|C) using various 3D representations and 
demonstrate excellent results. Besides, some single scene 3D asset production methods [44, 34, 2, 3, 
47, 40, 56] can also provide samples from this distribution. By integrating our framework with these 
existing methods, we can leverage their capabilities to enhance our mesh generation process. This 
integration allows for a more resource-efficient way to estimate p(M|C), significantly reducing the 
complexity and resources required compared to previous methods. 


4 Method 


In this section, we detail our shape condition strategy in Section 4.1. After that, we provide a detailed 
description for MeshAnything, which consists of a VQVAE with our newly proposed noise-resistant 
decoder (Section 4.2) and a shape-conditioned autoregressive transformer (Section 4.3). 


4.1 Shape Encoding for Conditional Generation 


We begin by describing our shape condition strategy. MeshAnything targets learning p(M|S), so we 
need to pair each mesh M with a corresponding S, i.e., the shape condition. Choosing an appropriate 
3D representation for S is non-trivial and should satisfy the following conditions: 


1. It should be easily extracted from various 3D representations. This ensures that the trained 
models can be integrated with a wide range of 3D asset production pipelines [44, 34, 30, 47, 
57]. 

2. It should be suitable for data augmentation to prevent overfitting. To ensure the effectiveness 


of S during training, any data augmentation applied to M must be equivalently applicable 
to S. 


3. It should be efficiently and conveniently input into the model as a condition. To ensure the 
model comprehends the shape information and to maintain efficient training, S must be 
easily and effectively encoded into features. 


Considering the first and second points, S should be in an explicit representation. Further considering 
the third point, the main explicit 3D representations that can be easily encoded as features are voxels 
and point clouds. Both representations are suitable, but voxels typically require a high resolution to 
accurately represent shapes, and processing high-resolution voxels into features is computationally 
expensive. Additionally, voxels, being a discrete representation, are less precise for data augmentation 


Figure 4: Additional qualitative results of MeshAnything. As shown, MeshAnything can be 
integrated with various 3D production pipelines to achieve highly controllable mesh generation. 


compared to point clouds. Therefore, we chose point clouds as the representation for S. To enhance 
the expressive power of the point clouds, we also include normals into the point cloud representation. 


To obtain point clouds from the ground truth mesh for training, we could simply sample point clouds 
directly from the surface of M. However, this would create problems during inference: the surfaces 
of automatically generated 3D assets are often rougher than those of AMs. For example, in AMs, we 
would sample a series of points on a flat plane, whereas automatically generated 3D assets would 
have uneven surfaces, causing a domain gap between training and inference. 


Therefore, we need to ensure that S extracted from the ground truth M during training has a similar 
domain to the S extracted during inference. To bring their domains closer, we intentionally construct 
coarse meshes from AMs. We first extract the signed distance function from M with [63], then 
convert it into a relatively coarse mesh using Marching Cubes [42] to destroy the ground truth 
topology. Finally, we sample point cloud and its normals from the coarse mesh. This approach also 
helps to avoid overfitting, as AMs typically have fewer faces, and each face can often sample multiple 
points. The network can easily recognize the ground truth topology by determining whether the 
points lie on the same plane. 


Since almost all 3D representations can be converted into a coarse mesh using Marching Cubes [42] 
or sampled into point clouds, this ensures that the domain of S is consistent during both training 
and inference. We pair the point clouds extracted as S with M to create a data item {(M,;,S;)}; for 
training. 


4.2 VQ-VAE with Noise-Resistant Decoder 


Following MeshGPT [55], we first train a VQ-VAE [61] to learn a vocabulary of geometric embed- 
dings for better transformer [62] learning. Different to MeshGPT, which uses graph convolutional 
networks [67] and ResNet [29] as the encoder and decoder respectively, we employ transformers with 
identical structures for both the encoder and decoder. When training VQ-VAE, meshes are discretized 
and input as a sequence of triangle faces: 

M := (fis fos 33-3 fN), (5) 
where f; is the coordinates of the vertices of each face, and N is the number of faces in M. The 
encoder E then extracts a feature vector for each face: 

Z = (21, 22,-.-,2n) = E(M), (6) 
where z; is the feature vector for fi. 


The extracted faces are then quantized into quantized features 7 with codebook $: 
T = RQ(Z;B) (7) 
Finally, the reconstructed mesh is decoded from 7 with decoder D by predicting the logits for each 


vertex’s coordinates: 


M = D(2) (8) 
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Figure 5: Qualitative Results. (a) further demonstrates our capability to achieve highly controllable 
mesh generation when combined with 3D asset production pipelines. Besides, we compare our 
reseults with ground truth in (b) and (c). In (b), MeshAnything generates meshes with better topology 
and fewer faces than the ground truth. In (c), we produce meshes with a completely different topology 
while achieving a similar shape, proving that our method does not simply overfit but understands how 
to construct meshes using efficient topology. 


The VQ-VAE is trained end-to-end with cross-entropy loss on the predicted vertex coordinate logits 
and the commitment loss of vector quantization [61]. After the training of VQ-VAE, the encoder- 
decoder of VQ-VAE is treated as a tokenizer and detokenizer for autoregressive transformer training. 


However, as shown in Fig. 6, there are possible imperfections in the generation results. To address this 
issue, given our setting of Shape-Conditioned AM Generation, the VQ-VAE decoder can also take the 
shape condition as input. Small imperfections in the token sequences generated by the transformer can 
potentially be corrected by a shape-aware decoder. Therefore, after completing the vanilla VQ-VAE 
training, we add an additional decoder fine-tuning stage, where we inject the shape information into 
the transformer decoder. Then we add random Gumbel noise to the codebook sampling logits to 
simulate the potential imperfections in the token sequences generated by the transformer during 
inference. The decoder is then updated independently with the same cross-entropy loss to train it to 
produce refined meshes even when facing imperfect token sequences. Our experiments in Tab. 1 and 
Tab. 2 show that our method effectively enhances the decoder’s noise resistance and mesh generation 
quality. 


4.3 Shape-Conditioned Autoregressive Transformer 


To add shape condition to the transformer, inspired by the success of multimodal large language 
models [68, 37, 70, 25], we first encode the point cloud into a fixed-length token sequence with 
a point cloud encoder P and then concatenate it to the front of the embedding sequence from 7 
VQ-VAE as the final input embedding sequence for the transformer: 


T’ =concat(P(S), T) (9) 


where 7” is the training input for the transformer. 


We borrow a pretrained point encoder from [75] and add a linear projection layer to project its 
output feature to the same latent space as 7. During training, the original point encoder from [75] 
is frozen; we only update the newly added projection layer and the autoregressive transformer with 
cross-entropy loss. 


During inference, we input P(S) to the transformer and require it to generate the subsequent sequence, 
T. T is then input to the noise-resistant decoder to reconstruct meshes: 


M = D(T) (10) 


where M is the final generated AM. 


We use the standard next-token prediction loss to train shape-conditioned transformers. For each 
sequence, we add a <bos> token after the point cloud tokens and a <eos> token after the mesh tokens 
to identify the end of a 3D mesh. 
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Figure 6: Ablation on Noise-Resistant Decoder. The decoder-only transformer may generate 
low-quality token sequences, and the decoder of VQ-VAE would typically produce flawed meshes 
based on these sequences. In contrast, our Noise-Resistant Decoder, aided by shape conditions, has 
the ability to resist these low-quality token sequences, producing higher-quality meshes. 


Table 1: Reconstruction Performance under Different Noise Levels with and without Noise- 
Resistant (NR) Decoder. Please refer to 5.4 for metrics explanation. 


Noise Level CD(x107?4} ECD(x1073} NCt 
W/ONR W/NR W/ONR WNR WONR W/NR 
0.0 0.011 0.007 0.035 0.023 0.987 0.993 
0.1 0.187 0.028 0.613 0.138 0.973 0.991 
0.5 1.167 0.639 2.538 1.329 0.964 0.981 
1.0 2.131 1.798 4.317 2.316 0.952 0.969 


5 Experiments 


5.1 Data Preparation 


Data Selection. Existing AM generation works are limited to a few categories. However, our method 
targets to operate on general shapes. MeshAnything is trained on a combined dataset of Objaverse [19] 
and ShapeNet [7], selected for their complementary characteristics. We chose Objaverse because it 
contains a large number of AMs without category limitations. On the other hand, ShapeNet offers 
higher data quality within limited categories. 


We filter out meshes with more than 800 faces from both datasets. Additionally, we manually filtered 
out low quality meshes. Our final filtered dataset consists of 51k meshes from Objaverse and 5k 
meshes from ShapeNet. We randomly select 10% of this dataset as the evaluation dataset, with the 
remaining 90% used as the training set for all our experiments. 


Data Processing and Augmentation. Following the strategies of PolyGen [45] and MeshGPT [55], 
we order faces by their lowest vertex index, then by the next lowest, and so on. Vertices are sorted 
in ascending order based on their z-y-x coordinates, where z represents the vertical axis. Within 
each face, we permute the indices to ensure the lowest index comes first. During training, we apply 
on-the-fly scaling, shifting, and rotation augmentations, normalizing each mesh to a unit bounding 
box from —0.5 to 0.5. 


5.2 Implementation Details 


The encoder and decoder of VQ-VAE both use the encoder of BERT [20], while we choose OPT- 
350M [74] as our autoregressive transformer architecture. The residual vector quantization [73] depth 
is set to 3, with a codebook size of 8,192. 


Our point encoder is based on the pretrained point encoder from [75], which has been trained on 
Objaverse and thus can handle general shapes. This point encoder outputs a fixed-length token 
sequence of 257 tokens, with 256 tokens primarily containing shape information and an additional 
head token containing semantic information about the shape. We sample 4096 points for each point 
cloud. 


The VQ-VAE is trained on 8 A100 GPUs for 12 hours and the transformer is trained on 8 A100 GPUs 
for 4 days. 


Table 2: Ablation on Noise-Resistant (NR) Decoder for the Quality of Mesh Generation. Please refer 
to 5.4 for metrics explanation. 


Method CDI ECD} NCT 
(x1077) (x107?) 


WONR 2.423 6.414 0.883 
W/ NR 2.256 6.245 0.902 


Table 3: Quantitative evaluation with Marching Cubes [42] and Remesh [4]. Please refer to 5.4 for 
metrics explanation. 


Method CD} ECD] NCt #V] #F} VRatio] F Ratio] 
(x1077) (x107? (x108) (x10°) 
(a) Marching Cubes 1.532 6.733 0.954 73.22 146.0 440.2 462.2 
(b) Remesh (0.005) 2.174 7.813 0.912 127.8 167.9 748.1 534.6 
(c) Remesh (0.010) 2.083 7.578 0.929 39.01 41.78 225.4 132.3 
(d) Remesh (0.030) 2.915 8.329 0.863 5.848 4.410 34.38 14.05 
(e) Remesh (0.050) 4.179 8.138 0.814 2.299 1.538 13.64 4.920 
(£) Remesh (0.100) 7.312 10.771 0.748 0.625 0.359 3.735 1.149 
(g) MeshAnything 2.256 6.245 0.902 0.172 0.318 0.888 0.871 


5.3 Qualitative Experiments 


We present more qualitative results of our model combined with other 3D asset production pipelines. 
As shown in the Fig. 1, Fig. 4 and Fig. 5, MeshAnything effectively generates AMs from various 
3D representations. When integrated with different 3D assets production pipelines, our method 
effectively achieves mesh generation with diverse conditions. 


Next, Fig. 5 demonstrates that our model does not simply overfit but understands how to generate 
meshes with efficient topology that conform to the given shape. To prove this, we use manually- 
created meshes as ground truth and use their shapes as conditions to test whether our model can 
generate meshes with comparable topology. To effectively use the ground truth as conditions, we 
first convert them into dense meshes using Marching Cubes [42] to disrupt their face structure. Then, 
we sample point clouds with normals from the dense meshes to serve as shape conditions. The 
experimental results in Fig. 5 show that MeshAnything is capable of generating meshes comparable 
to or even surpassing those modeled by human artists, exhibiting diverse and strong 3D modeling 
capabilities. 


5.4 Quantitative Experiments 


Metrics. We follow the evaluation metric setting of [12]. We quantitatively evaluate mesh quality by 
uniformly sampling 100K points from the faces of both the ground truth meshes and the predicted 
meshes, and then computing a set of metrics to assess various aspects of the reconstruction. We 
report the following metrics: Chamfer Distance (CD) to evaluate the overall quality of a reconstructed 
mesh; Edge Chamfer Distance (ECD) to assess the preservation of sharp edges by sampling points 
near sharp edges and corners, and Normal Consistency (NC) to evaluate the quality of the surface 
normals. Additionally, we report the number of mesh vertices (#V) and the number of mesh faces 
(#F). We also provide the ratio of the estimated number of vertices to the ground truth number of 
vertices (#V_Ratio) and the same ratio for faces (#F_Ratio). 


Ablations on Noise-Resistant Conditional Decoder. We first perform ablation experiments to verify 
the effectiveness of the Noise-Resistant Decoder. We begin with a VQ-VAE trained without any noise 
or conditioning. We then perform ablation between two settings: one where the decoder remains 
unchanged and unaware of the shape condition, and another where the shape condition is injected 
into the transformer, as described in Section 4.2. Next, we randomly sample a noise from gumbel 
distribution and add it to codebook sampling logits during the vector quantization process to simulate 
the potential low-quality token sequences generated by the transformer. We control the noise level by 
scaling the added noise. 
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After training both models for enough epochs, we test their performance to the same level of noise. 
As shown in Tab. 1, as the intensity of the added noise increases, the Noise-Resistant Decoder with 
shape condition clearly achieves better reconstruction results. This indicates that the shape condition 
helps the decoder identify and correct imperfections in the input token sequences. 


Next, we verify whether the Noise-Resistant Decoder indeed enhances the transformer’s performance 
during inference. The test method used dense meshes derived from corrupted GT meshes as the 
condition for generating new meshes. The generated meshes were then assessed for shape alignment 
with the conditional shape. As shown in Tab. 2, the model with Noise-Resistant Decoder achieved 
better results. 


Comparison with Marching Cubes and Remesh. Our method is related to various mesh extraction 
methods [42, 13, 12, 53] since we all convert other 3D representations into meshes. However, it 
is important to note that previous approaches are reconstruction-like methods that produce dense 
meshes, while our approach is generative, creating Artist-Created Meshes (AMs) that are significantly 
more complex to produce than dense meshes. Current metrics can only indicate the quality of shape 
alignment, which do not effectively reflect the topological advantages of our method. Among the 
numerous mesh extraction methods, we chose the most representative one, Marching Cubes [42], for 
comparison. 


We compare the results of mesh extraction using Marching Cubes [42] and its combination with 
Blender remesh [4]. Since our evaluation dataset includes non-watertight meshes, we first extract the 
signed distance fields (SDF) of all ground truth meshes using [63], which can handle non-watertight 
meshes. We then apply Marching Cubes with a resolution of 128 on these SDFs. Next, we apply 
Blender remesh [4] with different voxel sizes to the Marching Cubes results, as both the remesh 
method and our approach are capable of simplifying topology. Additionally, the Marching Cubes 
result is used as the shape condition input to MeshAnything to obtain our results. 


As shown in Tab. 3, we found that these methods require hundreds of times more faces to achieve 
results comparable to our method. Furthermore, our approach achieves the best performance in Edge 
Chamfer Distance (ECD), even with significantly fewer faces than other methods. Comparing (a) 
and (g), our method lags in Chamfer Distance (CD) and Normal Consistency (NC), mainly due to 
our method’s inherent failure cases as a generative model, which makes it less robust than Marching 
Cubes. When comparing with remesh methods, we observe that they incur a high cost to achieve a 
face count similar to ours. Comparing (f) and (g), we find that even when remesh methods achieve 
a comparable face count, the number of vertices is still several times higher than ours, indicating 
that the topology efficiency of remesh methods is far inferior to ours, as they completely ignore the 
shape characteristics of the 3D assets. Additionally, we surprisingly find that our method can produce 
results with fewer faces than the ground truth, demonstrating that MeshAnything is not overfitting to 
the data but instead learns an efficient topology representation, occasionally surpassing the ground 
truth meshes. 


6 Limitations 


Our method cannot generate meshes that exceed the maximum face count limit, so it cannot convert 
large scenes and particularly complex objects into meshes. Additionally, due to its generative nature, 
our method is not as stable as methods like Marching Cubes [42]. 


7 Social Impact 


Our method points to a promising approach for the automatically generation of Artist-Created Meshes, 
which has the potential to significantly reduce labor costs in the 3D industry, thereby facilitating 
advancements in industries such as gaming, film, and the metaverse. However, the reduced cost of 
obtaining 3D artist-created meshes could also lead to potential criminal activities. 


8 Conclusion 


In this work, we propose a novel setting for improved mesh extraction and mesh generation, namely 
Shape-Conditioned Artist-Created Mesh (AM) Generation. Following this setting, we introduce 


11 


MeshAnything, a model capable of generating AMs that adhere to given 3D assets. MeshAnything can 
convert 3D assets in any 3D representation into AMs and thus can be integrated with diverse 3D asset 
production methods to facilitate their application in the 3D industry. Furthermore, we introduce a 
noise-resistant decoder architecture to enhance the generation quality, enabling the model to handle 
low-quality token sequences produced by autoregressive transformers. Lastly, extensive experiments 
demonstrate the superior performance of our method, highlighting its potential to scale up for 3D 
industry application and its advantage over previous methods. 
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